Primary Vignette - Classification Strategies

Objectives

  • Perform exploratory data analysis

  • Reduce dimensionality using principal component analysis (PCA)

  • Employ logistic regression using glm() and multinom()

  • Build a random forest model using cross-validation

We’ll illustrate these strategies using the California Household Travel Survey (CHTS) dataset. We are interested in classifying the block group density variable bg_group which takes on values urban, suburban, exurban, and rural.

Data Cleaning

First, we want to load any necessary packages and our 3 datasets into R Studio. We will then merge all 3 datasets into 1 dataframe and convert variables into factors as needed while also filtering out numeric variables that have numbers that represent NAs.

# Loading necessary packages
library(tidyverse)
library(ggplot2)
library(dplyr)
library(tidymodels)
library(sparsesvd)
library(nnet)
library(sf)
library(mapview)
mapviewOptions(fgb = FALSE)
library(leafsync)
library(maps)
library(nnet)
library(randomForest)
library(vip)
library(SparseM)
library(Matrix)
library(kableExtra)

# Read in datasets
PersonData <- read_rds('./Data/raw/PersonData_111A.Rds')
HHData <- read_rds('./Data/raw/HHData_111A.Rds')
hh_bgDensity <- read_rds('./Data/raw/hh_bgDensity.Rds')
county_shp <- st_read("./Data/raw/counties/counties.shp")
Reading layer `counties' from data source 
  `/Users/valeriedelafuente/Documents/Capstone/vignette-group3/Data/raw/counties/counties.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 58 features and 17 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -124.3926 ymin: 32.53578 xmax: -114.1312 ymax: 42.00952
Geodetic CRS:  WGS 84
# Merge datasets
personHHData <- left_join(PersonData, HHData) %>%
  left_join(hh_bgDensity)

data <- personHHData  %>% 
  filter(WorkDaysWk != 8 , WorkDaysWk !=9,
         TypicalHoursWk != 998, TypicalHoursWk!= 999,
         TransitTripsWk != 98, TransitTripsWk != 99,
         WalkTripsWk != 98, WalkTripsWk != 99,
         BikeTripsWk != 98, BikeTripsWk != 99) %>% 
  mutate(HH_anyTransitRider = as.factor(HH_anyTransitRider),
         HH_homeType = as.factor(HH_homeType),
         HH_homeowner = as.factor(HH_homeowner),
         HH_isHispanic = as.factor(HH_isHispanic),
         HH_intEnglish = as.factor(HH_intEnglish),
         HH_income = as.factor(HH_income),
         Male = as.factor(Male),
         persWhite = as.factor(persWhite),
         persHisp = as.factor(persHisp),
         persAfricanAm = as.factor(persAfricanAm),
         persNativeAm = as.factor(persNativeAm),
         persAsian = as.factor(persAsian),
         persPacIsl = as.factor(persPacIsl),
         persOthr = as.factor(persOthr),
         persDKrace = as.factor(persDKrace),
         persRFrace = as.factor(persRFrace),
         bornUSA = as.factor(bornUSA),
         DriverLic = as.factor(DriverLic),
         TransitPass = as.factor(TransitPass),
         Employed = as.factor(Employed),
         WorkFixedLoc = as.factor(WorkFixedLoc),
         WorkHome =as.factor(WorkHome),
         WorkNonfixed = as.factor(WorkNonfixed),
         FlexSched = as.factor(FlexSched),
         FlexPrograms = as.factor(FlexPrograms),
         Disability = as.factor(Disability),
         DisLicensePlt = as.factor(DisLicensePlt),
         Student = as.factor(Student),
         workday = as.factor(workday),
         hhid = paste(hhid, pnum),
         SchoolMode = as.factor(SchoolMode),
         WorkMode = as.factor(WorkMode),
         EducationCompl = as.factor(EducationCompl)) %>% 
  select(-pnum)

Next, we will preprocess our data to be ready for PCA by selecting only the numeric columns and scaling the data. Then, we will want to add back in our response variable column, bg_group, and our ID column, hhid, to the scaled data. Lastly, we want to remove any NA values from the data, since PCA cannot handle them.

# Preprocess data
numeric_columns <- sapply(data, is.numeric)

numeric_data <- data[, numeric_columns] %>% select(-CTFIP, -bg_density)

scaled_data <- scale(numeric_data)

hhid <- data$hhid
bg_group <- as.data.frame(data$bg_group)
scaled_data <- cbind(hhid, bg_group, scaled_data) %>% 
  mutate(bg_group = data$bg_group)

scaled_data_clean <- na.omit(scaled_data) %>% 
  as.data.frame() %>% 
  select(-`data$bg_group`)

Exploratory Data Analysis

Before we partition our data and fit classification models, we want to get to know our data through some exploratory visualizations. Since the data is geographically structured, we found it helpful to visualize it on a map. Since we are interested in classifying household density categories, we chose 2 variables, `Sum_PMT` and `Sum_Trips`, which intuitively may be significant in predicting our categories.

Total Distance by Residential Area

county_bg_aggreg <- personHHData %>% 
  group_by(County, CTFIP, bg_group) %>%  # group by county, CTFIP, and also bg_group
  mutate(count = n()) %>% 
  summarise_at(vars(-hhid), mean)

county_bg_shp <- county_shp %>% 
  merge(data.frame(bg_group = c("Urban", "Suburban", "Exurban", "Rural"))) %>% 
  left_join(county_bg_aggreg)

# get the CA county data
county <- ggplot2::map_data("county", region = "california")

county_bg <- merge(county, data.frame(bg_group = c("Urban", "Suburban", "Exurban", "Rural")))

county_bg_all <- county_bg_aggreg %>% 
  mutate(subregion = tolower(County)) %>% 
  full_join(county_bg, by = c("subregion", "bg_group"))

ggplot(county_bg_all) +
  geom_polygon(aes(x = long, y = lat, group = subregion, fill = Sum_PMT), colour = "white") +
  scale_fill_distiller(palette = "YlGnBu", direction = 1) +
  facet_wrap(vars(bg_group), nrow = 2) +  
  ggtitle("Total PMT in California at County-level") + 
  theme_void() +
  theme(legend.position="bottom")

This map shows the total distance traveled by county for each of the categories. The grey counties represent missing data. There tends to be more travel in rural or exurban areas compared to suburban or urban areas. This makes sense as employees in those areas may find themselves needing to compute to cities for work and thus travel longer distances.

Total Number of Trips by Residential Area

urban_TripMap <-  mapview(filter(county_bg_shp, bg_group == "Urban"),
                          zcol = "Sum_Trips", legend = TRUE, popup = NULL,
                          layer.name = "Urban Trips")

suburb_TripMap <- mapview(filter(county_bg_shp, bg_group == "Suburban"),
                          zcol = "Sum_Trips", legend = TRUE, popup = NULL,
                          layer.name = "Suburban Trips")

exurb_TripMap <- mapview(filter(county_bg_shp, bg_group == "Exurban"),
                         zcol = "Sum_Trips", legend = TRUE, popup = NULL,
                         layer.name = "Exurban Trips")

rural_TripMap <- mapview(filter(county_bg_shp, bg_group == "Rural"),
                         zcol = "Sum_Trips", legend = TRUE, popup = NULL,
                         layer.name = "Rural Trips")

latticeview(urban_TripMap, suburb_TripMap, exurb_TripMap, rural_TripMap, sync = "all")

This dynamic visual represents the number of trips per county by household density categories (urban, suburban, exurban, and rural). The trend is that suburban and exurban areas tend to have lower number of trips taken. This intuitively also makes sense as those places may find themselves needing to combine trips since they are not close enough to the city.

Based on our EDA, we plan to include these two variables in our predictive feature set. However, we will later analyze the PCA loadings to assess their contributions to the principal components and validate this decision.

Data Partitioning

Before fitting models, it is imperative to split our data into a training set and a testing set. We will use the training set to train our models. Once our models are trained, we will fit the best model to the testing set and see how it truly performs. We are employing the 80/20 split, where 80% of the data goes to the training data and 20% of the data goes to the testing data. This ensures that there is data to both train our models and properly test model performance. set.seed() is used to ensure reproducibility of our findings.

# Set seed
set.seed(14531)

data_clean  <- data.frame(lapply(scaled_data_clean, as.numeric))
# Partition data
partitions <- scaled_data_clean %>% 
  initial_split(prop = 0.8)

# Separate id and response variable in testing and training data
test_features <- testing(partitions) %>% 
  select(-hhid, -bg_group)
test_labels <- testing(partitions) %>% 
  select(hhid, bg_group)

train_features <- training(partitions) %>% 
  select(-hhid, -bg_group)
train_labels <- training(partitions) %>%
  select(hhid, bg_group)

train_features <- data.frame(lapply(train_features, as.numeric))
test_features  <- data.frame(lapply(test_features, as.numeric))

Principal Component Analysis

To reduce the dimensionality of the dataset, we employ PCA.

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms data into fewer uncorrelated components (principal components) while retaining most of the variance.

PCA can be used for:

- Dimensionality Reduction: improves model efficiency by reducing the number of features

- Noise Filtering: discards low-variance features that don’t add much value

- Visualization: projects high-dimensional data into 2D or 3D for better pattern recognition

PCA Limitations:

- Captures only linear patterns

- New components may be harder to interpret since they combine original features

Step 1: Perform PCA on training data

# Set seed
set.seed(14531)

# Ensure the data is numeric and convert to a matrix
train_features_matrix <- as.matrix(train_features)
test_features_matrix <- as.matrix(test_features)

# Convert training data to sparse matrix to use sparsesvd function to perform PCA
train_features_sparse <- as(train_features_matrix, "sparseMatrix")

# Perform PCA on sparse training data matrix and turn into dataframe
train_svd <- sparsesvd(train_features_sparse, rank = 18)
training_projected <- as.data.frame(train_svd$u %*% diag(train_svd$d))

# Assign column names
colnames(training_projected) <- paste0("PC", 1:ncol(training_projected))

Step 2: Visualizing variance explained by principal components

First, determine the optimal number of principal components to retain based on their respective variance. We set a threshold of 80%, ensuring the selected PCs capture at least 80% of the total variance in the dataset.

We will visualize the transformed data to assess patterns in the reduced feature space.

# Explained variance plot
singular_values <- train_svd$d
variance_explained <- (singular_values^2) / sum(singular_values^2)

plot(variance_explained, type = "b", xlab = "Principal Component",
     ylab = "Proportion of Variance Explained", main = "Scree Plot")
abline(h = 0.1, col = "red", lty = 2)

# Cumulative variance plot
cumulative_variance <- cumsum(variance_explained)

plot(cumulative_variance, type = "b", xlab = "Principal Component",
     ylab = "Cumulative Variance Explained", main = "Cumulative Variance")

# Set the threshold for the cumulative variance (80%)
threshold <- 0.8
reduced_pcs <- which(cumulative_variance >= threshold)[1]

# Print the number of PCs to keep
cat("Number of PCs to retain:", reduced_pcs)
Number of PCs to retain: 12
# Plot PC1 vs PC2
plot(training_projected$PC1, training_projected$PC2, xlab = "PC1", ylab = "PC2",
     main = "PCA - PC1 vs PC2", pch = 19, col = "blue")

Step 3: Project testing data onto training PCA

This step ensures consistency between training and testing data representations for subsequent modeling.

# Set seed
set.seed(14531)

# Define function
reproject_fn <- function(.dtm, train_projected) {
  .dtm_sparse <- as(.dtm, "sparseMatrix")
  test_projected <- as.matrix(.dtm_sparse %*% train_projected$v %*% diag(1 / train_projected$d))
  colnames(test_projected) <- paste0("PC", 1:ncol(test_projected))
  return(test_projected)
}

# Project
test_projected <- reproject_fn(test_features_matrix, train_svd)

Now, the dataset has been transformed.

Multinomial Logistic Regression

Now that we have reduced the dimensionality of the dataset, we can feed the transformed data into the logistic regression model.

Multinomial logistic regression is an extension of binary logistic regression designed to handle multiple classes. This model works by estimating the log-odds of the different classes relative to a baseline class, and for each observation, it predicts the class whose log-odds of occurrence is the highest.

Benefits of Logistic Regression:

- Simplicity and Interpretability: the coefficients can give insights into the relationship between the features and the predicted class

- Probabilistic Output: provides probabilities useful for decision-making

- Efficiency: doesn’t require too much computational power compared to more complex models

Challenges of Logistic Regression:

- Class Overlap: if the different classes overlap heavily in the reduced feature space, the linear decision boundary assumed by logistic regression might not be effective in separating the classes

- Model Limitations: Logistic regression assumes that the predictors are independent and linearly related to the log-odds of the predictor variable

Step 1: Get the reduced PCA data that you will feed into logistic regression model

# Select 12 PCs for training and testing data
training_projected_reduced <- training_projected[, 1:reduced_pcs]
test_projected_reduced <- test_projected[, 1:reduced_pcs]

# Add response variable, bg_group, back to PCs in training
reduced_training <- cbind(training_projected_reduced, bg_group = train_labels$bg_group)

# Add response variable, bg_group, back to PCs in training
reduced_testing <- cbind(as.data.frame(test_projected_reduced), bg_group = test_labels$bg_group)

Step 2: Fit logistic regression model with the PCA reduced training data

# fit model to predict bg_group using all predictors on PCA reduced training data
log_regmodel <- multinom(bg_group ~ ., data = reduced_training)
# weights:  56 (39 variable)
initial  value 32694.366213 
iter  10 value 31938.099816
iter  20 value 31717.082184
iter  30 value 31700.157709
iter  40 value 31684.609508
final  value 31678.690897 
converged

Let’s see how well our model was able to classify the different household densities

# Generate predictions for the test dataset using the fitted log reg model
logreg_predictions <- predict(log_regmodel, newdata = reduced_training)

Step 3: Accuracy measures

Finally, look at the accuracy of the logistic regression model by summing across the diagonal of the confusion matrix.

# Create a confusion matrix
conf_matrix <- table(Predicted = logreg_predictions, Actual = reduced_training$bg_group)

# Calculate overall accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
print(paste("Accuracy:", round(accuracy, 4)))
[1] "Accuracy: 0.3368"

As you can see, the low accuracy indicates that PCA might not be the best approach for the logistic regression model.

Step 4: Linearity Assumption

Now we aim to find out why the logistic regression model is failing. Since logistic regression models the log-odds of the outcome as a linear combination of the predictors, the model assumes a linear relationship between these items. If the relationship between the predictors and the log-odds is non-linear, the logistic regression model may not perform well, which is demonstrated in the case of this dataset. Let’s check the linearity assumptions to further show this.

# Extract non PCA training data
training_logreg <- training(partitions)

# Select sub sample of data to test assumption on
training_sample <- training_logreg %>%
  sample_n(1000)

# Create binary response variable
training_sample$bg_group_binary <- ifelse(training_sample$bg_group == "1", 1, 0)

# Fit binary logistic regression on one predictor
logreg_binary <- glm(bg_group_binary ~ WorkDaysWk, data = training_sample, family = binomial)

# Predict log odds for each observation using link = 'logit' argument
predicted_logodds <- predict(logreg_binary, type = "link",  link = "logit")

# Plot relationship between predictor and predicted log odds of outcome
nonlinear_plot <- ggplot(training_sample, aes(x = WorkDaysWk, y = predicted_logodds)) + 
    geom_point() + 
    geom_smooth(method = "loess")  
nonlinear_plot

From observing the plot above, it is clear that the relationship between the predicted log odds and a predictor, WorkDaysWk, is non-linear. We use the non-PCA data to test the linearity assumption because the principal components no longer represent the original features in a directly interpretable way. If we want to see if each predictor has a linear relationship with the log odds of the outcome, we use the non transformed data. In addition, we simplify the problem to a binary classification so that we can focus on how the relationship between a predictor and the log-odds of that outcome behaves, giving us a clearer, more interpretable plot for testing the linearity assumption.

Now that we have concluded that this dataset fails the linearity assumption, we know that it would be best modeled by other classification models, like the random forest model, which is what we will fit next.

Random Forest

A random forest is a collection of decision trees where the output is determined by majority voting across the different trees. It can handle non-linear data and helps us understand complex patterns in data. It can also tell us which variables are most important to determining the response variables, which helps us understand the data.

To implement this, we will be using k-fold cross-validation, which will take the training data, split it into k subsets, and alternate one subset as the validation fold and the rest as the training. This will help us build a better model and make sure it works well with unseen data.

We will also tune the random forest model parameters to ensure we get the most accurate model. The parameters include: mtry which is the number of variables randomly sampled at each split, trees which is the number of trees in the random forest, and min_n which is the minimum number of observations in each node.

Step 1: Set up the data for Random Forest modeling

For Random Forest modeling, we want to use the original clean dataset, from before preprocessing for PCA. However, there is a lot of missing data that we need to take a look at:

#look at missing data
colSums(is.na(data))
              hhid               Male                Age           persHisp 
                 0                  0                  0                  0 
         persWhite      persAfricanAm       persNativeAm          persAsian 
                 0                  0                  0                  0 
        persPacIsl           persOthr         persDKrace         persRFrace 
                 0                  0                  0                  0 
           bornUSA          DriverLic        TransitPass           Employed 
                 0                  0                  0                  0 
      WorkFixedLoc           WorkHome       WorkNonfixed         WorkDaysWk 
                 0                  0                  0                  0 
    TypicalHoursWk          FlexSched       FlexPrograms         Disability 
                 0                115                127                  0 
     DisLicensePlt     TransitTripsWk        WalkTripsWk        BikeTripsWk 
             28837                  0                  0                  0 
           Student           WorkMode         SchoolMode     EducationCompl 
                 0               4418              27376                  0 
           workday          Walk_Dist          Bike_Dist    DriveAlone_Dist 
                 0                  0                  0                  0 
  Driveothers_Dist     Passenger_Dist         Plane_Dist     Allothers_Dist 
                 0                  0                  0                  0 
        Walk_trips         Bike_trips   DriveAlone_trips  Driveothers_trips 
                 0                  0                  0                  0 
   Passenger_trips        Plane_trips    Allothers_trips          Sum_Trips 
                 0                  0                  0                  0 
           Sum_PMT              CTFIP             County                MPO 
                 0                  0                  0                  0 
              City                DOW            HH_size          HH_nTrips 
                 0                  0                  0                  0 
     HH_nEmployees       HH_nStudents       HH_nLicenses           HH_nCars 
                 0                  0                  0                  0 
         HH_nBikes          HH_income HH_anyTransitRider        HH_homeType 
                 0                  0                  0                  0 
      HH_homeowner      HH_isHispanic      HH_intEnglish         bg_density 
                 0                  0                  0                  0 
          bg_group 
                 0 

From the table, you will see there is a high amount of missing data in DisLicensePlt (if the respondent has a disabled license plate) and SchoolMode (how the respondent travels to school). We will impute these missing values with ‘Not Disabled’, and ‘Not in School’ respectively. We will also remove any location related variables, as those are deirectly correlated to bg_group and remove any other rows with missing values.

# Set seed for reproducibility
set.seed(14531)

#clean data
rf_data <- data %>% 
  select(-County, -CTFIP, -MPO, -City, -bg_density, -hhid) %>% 
  mutate(DisLicensePlt = as.factor(ifelse(is.na(DisLicensePlt), 'Not Disabled', DisLicensePlt)),
  SchoolMode = as.factor(ifelse(is.na(SchoolMode), 'Not in School', SchoolMode))) %>% 
  na.omit()

Step 2: Data Partitioning

Next we want to split the data into training and testing sets. We will use 80% of the data for training and 20% for testing.

# Set seed for reproducibility
set.seed(14531)

# Split the dataset
partitions <- initial_split(rf_data, prop = 0.8)

# Create training and testing datasets
train_data <- training(partitions)
test_data <- testing(partitions)

Step 3: Define the model

Now, we want to specify what kind of model we want to run with rand_forest(). We want to set the parameters, mtry, trees, and min_n, to tune(), so that we can later use cross-validation to test which values is the most optimal to set the parameters to. We will use the ranger engine since it runs the quickest and set the mode to classification.

# Specify the random forest model
rf_spec <- rand_forest(
  mtry = tune(),
  trees = tune(),
  min_n = tune()
) %>% 
  set_engine("ranger",importance = "impurity") %>% 
  set_mode("classification")

Step 4: Define the workflow

Next, we set up the workflow and add in the model we defined previously. We will also add the formula, with bg_group as the response variable and all other variables in the dataset as predictors.

# Define a workflow
rf_workflow <- workflow() %>% 
  add_model(rf_spec) %>% 
  add_formula(bg_group ~ .)

Step 5: Set up cross-validation

Now we want to set up cross-validation with 5 folds. We also want to make sure we stratify the data by bg_group to ensure that each fold has an equal distribution of the different household densities. Then, we also set up a grid to test different combinations of the parameters mtry, trees, and min_n with 5 levels.

# Set up cross-validation
set.seed(14531)
train_folds <- vfold_cv(train_data, v = 5, strata = bg_group)

# Set up a grid to test parameter combinations
rf_grid <- grid_regular(mtry(range = c(5, 30)), 
                        trees(range = c(100, 500)),
                        min_n(range = c(5, 10)),
                        levels = 5)

Step 6: Tune the Model

We will tune the model on the grid we set up with the cross validation folds. Running a random forest model can take a long time. We want to make sure to save our tune results to an RData file so that we don’t have to continuously run the model tuning and take up time.

set.seed(14531)
# Tune the model
tune_result <- tune_grid(
  rf_workflow,
  resamples = train_folds,
  grid = rf_grid
)

#Save the results in RData File
save(tune_result, file = "data/tune_result.rda")

Step 7: Finalize the model

Now that we looked through the different parameter combinations, we want to extract the parameters that led to the most accurate model and finalize the model with those parameters. We load back in our RData file with the model tuning results and use the select_best function to get the most accurate model. The best parameters can be seen below:

#Load back in RData file
load("data/processed/tune_result.rda")
# Extract the best parameters
best_params <- select_best(tune_result, metric = "accuracy")
print(best_params)
# A tibble: 1 × 4
   mtry trees min_n .config               
  <int> <int> <int> <chr>                 
1     5   457     5 Preprocessor1_Model041

Then we want to finalize the model with the best parameters and train that model on the entire training data. We will then evaluate the model on the test data and get the predictions.

# Finalize the workflow with the best parameters
final_rf <- finalize_workflow(rf_workflow, best_params)

# Train the final model on the entire training data
final_rf_fit <- fit(final_rf, data = train_data)

# Evaluate the model on the test data
predictions <- predict(final_rf_fit, test_data) %>% bind_cols(test_data)

head(predictions) %>% kable()
.pred_class Male Age persHisp persWhite persAfricanAm persNativeAm persAsian persPacIsl persOthr persDKrace persRFrace bornUSA DriverLic TransitPass Employed WorkFixedLoc WorkHome WorkNonfixed WorkDaysWk TypicalHoursWk FlexSched FlexPrograms Disability DisLicensePlt TransitTripsWk WalkTripsWk BikeTripsWk Student WorkMode SchoolMode EducationCompl workday Walk_Dist Bike_Dist DriveAlone_Dist Driveothers_Dist Passenger_Dist Plane_Dist Allothers_Dist Walk_trips Bike_trips DriveAlone_trips Driveothers_trips Passenger_trips Plane_trips Allothers_trips Sum_Trips Sum_PMT DOW HH_size HH_nTrips HH_nEmployees HH_nStudents HH_nLicenses HH_nCars HH_nBikes HH_income HH_anyTransitRider HH_homeType HH_homeowner HH_isHispanic HH_intEnglish bg_group
Urban 1 67 0 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 3 40 0 0 0 Not Disabled 6 5 0 0 Driver Not in School Graduate degree 1 0.8467638 0 35.93732 0.00000 0.00000 0 0 2 0 2 0 0 0 0 4 36.78409 Wednesday 2 6 2 0 2 1 2 4 1 1 1 0 1 Suburban
Suburban 1 44 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 5 60 1 0 0 Not Disabled 1 3 1 0 Driver Not in School Some college 1 0.0000000 0 23.08985 0.00000 0.00000 0 0 0 0 2 0 0 0 0 2 23.08985 Tuesday 3 8 2 1 2 2 1 10 0 1 1 0 1 Suburban
Suburban 0 30 1 1 0 0 0 0 0 0 0 8 0 0 1 1 0 0 6 40 0 0 0 Not Disabled 1 0 0 0 Passenger Not in School Some college 1 0.0000000 0 0.00000 0.00000 0.00000 0 0 0 0 0 0 0 0 0 0 0.00000 Wednesday 5 8 2 2 0 0 1 1 1 2 0 1 0 Urban
Urban 1 54 0 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 2 12 0 0 0 Not Disabled 0 6 0 0 Driver Not in School Graduate degree 1 0.0000000 0 78.57123 0.00000 0.00000 0 0 0 0 5 0 0 0 0 5 78.57123 Wednesday 2 11 2 0 2 2 2 6 1 2 1 1 1 Urban
Suburban 0 47 0 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 5 40 0 0 0 Not Disabled 0 5 0 0 Driver Not in School Associate degree 1 0.0000000 0 0.00000 0.00000 18.02676 0 0 0 0 0 0 2 0 0 2 18.02676 Friday 2 4 2 0 2 2 2 5 0 1 1 1 1 Exurban
Suburban 0 42 0 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 5 30 0 0 0 Not Disabled 18 7 0 0 Driver Not in School Some college 1 0.0000000 0 0.00000 56.11904 0.00000 0 0 0 0 0 10 0 0 0 10 56.11904 Tuesday 5 29 2 3 2 2 5 7 1 1 1 0 1 Rural

Step 8: Evaluate the model

Using a confusion matrix, we can see how many predictions the model got right and wrong for each bg_group level. We found that the model is best at predicting the Suburban group and the the Urban group. We can also calculate the accuracy metric, which we got 0.447 for this model.

# Confusion matrix
confusion_matrix <- predictions %>% 
  conf_mat(truth = bg_group, estimate = .pred_class)
print(confusion_matrix)
          Truth
Prediction Urban Suburban Exurban Rural
  Urban      696      300     175    85
  Suburban   485      876     545   332
  Exurban    113      204     349   173
  Rural       64      149     136   308
# Calculate accuracy
accuracy <- predictions %>% 
  metrics(truth = bg_group, estimate = .pred_class) %>% 
  filter(.metric == "accuracy")
print(accuracy)
# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.447

We also want to take a look at the top 10 most important variables in influencing the model’s predictions. This most important variables was the total number of miles the person traveled on the survey data, followed by age and the number of trips the household takes.

# Plot variable importance with top 10 important variables
final_rf_fit %>% 
  vip() +
  theme_minimal()

According to the variance important plot, the most important variables, in order, are Sum_PMT, Age, HH_nTrips, DriveAlone_Dist, WalkTripsWk, HH_income, TypicalHoursWk, DOW, HH_nBikes, and Sum_Trips. However, we had an accuracy of 44.3%, so this plot might not be representative of true feature importance.

We used Random Forest to build a classification model aimed at predicting the bg_group based on various features. We applied k-fold cross-validation to fine-tune the model’s parameters, including mtry, trees, and min_n, to identify the most effective combination. The parameter tuning was performed using a grid search, ensuring that we explored different combinations to optimize performance. While the model showed a reasonably good ability to identify the most important variables contributing to predictions—such as total miles traveled, age, and the number of household trips—the overall accuracy on the test set was lower than expected (0.447). This is in stark contrast to the numeric-only model that yielded an impressive accuracy of 0.97.

This discrepancy suggests that the inclusion of categorical variables, while enriching the model with more detailed features, may have introduced complexity that lowered predictive performance. Potential factors include the handling of missing data, the level of feature interactions, and how the model managed high-cardinality categorical variables.

Conclusion

In conclusion, this project aimed to classify households into four classes: urban, suburban, exurban, and rural, based on a number of predictors that provided information about each household such as the total miles traveled, total hours worked per week, and total distance walked. Using these 70 predictors, principal component analysis was performed to reduce the dimensionality of the dataset and ultimately 12 principal components were retained, explaining 80% of the cumulative variance. Following PCA, a multinomial logistic regression model was fit which attempted to predict bg_group with the 12 principal components. Ultimately, this model failed, resulting in around 33.7% accuracy. This result indicates that the assumptions made for a logistic regression model, like the linearity assumption, failed to hold for this dataset, which was proven through visualizing the non-linear relationship between the log-odds of the outcome and a predictor. This conclusion lead us to explore new classification models like the random forest model. For this model, the hyperparameters trees, m_try, min_n were all tuned, and cross validation was employed to produce a more robust model. These methods produced an accuracy of 44.7%, which proves that this multinomial classification problem was a better fit for decision tree models like random forest, but there is still room for improvement.

Looking forward with this dataset, there are other models that could classify the observations into the four household density categories more accurately, such as support vector machines, gradient boosted machines, or neural networks. These models could handle the more complex data and patterns that simpler models might not perform as well on. But if we wanted to continue with the random forest model, then more hyperparameter tuning could result in higher accuracy as well. Another aspect of this dataset that could lead to better classification could be to utilize the geospatial features of this dataset. These were explored in the EDA, but could be very helpful predictors for a model to use in the future with this dataset. Overall, by fitting two classification models to the CHTS dataset, there was a lot of important exploration uncovered about the patterns of this data.